1. Data source: The data is available on Kaggle and was originally sourced from an anonymous US insurance company.
2. What the data are: The dataset contains information on insurance policyholders, including variables such as age, sex, BMI, number of children, smoker status, region, and charges (the amount charged by the insurance company for health coverage). Some of these variables are categorical (e.g., sex, smoker status, region) and some are numerical (e.g., age, BMI, charges).
3. Sample size (n) and number of variables (k): The dataset has 1,338 rows (policyholders) and 7 columns (variables).
4. Why the data are of interest: The data can be used to explore various questions related to insurance policy holders, such as whether certain demographic groups are more likely to smoke, whether there is a relationship between BMI and charges, and whether charges vary by region. This information can be useful for insurance companies in setting prices and designing policies, as well as for researchers studying health outcomes and disparities.
5. Disclaimer: This dataset is very skewed, and using our current knowledge, there may not be enough information to correctly transform the model.
Can we predict the charges of an insurance policyholder based on their age, gender, BMI, number of children, smoker status, and region of residence?
library(readxl)
library(car)
## Loading required package: carData
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
insurance <- read_excel("insurance.xlsx")
insurance$smoker <- relevel(factor(insurance$smoker), ref="yes")
This data is already very complete and cleaned. Thus, there is not much organizing that needs to be done. Every predictor makes sense in predicting total charges for insurance.
full_model <- lm(charges~. , data = insurance)
summary(full_model)
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11527.3 -2887.5 -995.5 1422.3 29711.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11909.72 1072.11 11.109 < 2e-16 ***
## age 259.34 12.35 20.999 < 2e-16 ***
## sexmale -30.15 345.34 -0.087 0.93045
## bmi 337.45 29.62 11.394 < 2e-16 ***
## children 466.98 142.91 3.268 0.00111 **
## smokerno -23954.39 426.63 -56.148 < 2e-16 ***
## regionnorthwest -318.01 496.36 -0.641 0.52185
## regionsoutheast -913.78 495.21 -1.845 0.06524 .
## regionsouthwest -899.42 492.25 -1.827 0.06791 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6121 on 1259 degrees of freedom
## Multiple R-squared: 0.75, Adjusted R-squared: 0.7484
## F-statistic: 472.1 on 8 and 1259 DF, p-value: < 2.2e-16
This includes checking for mistakes and coding qualitative variables using indicator variables.
str(insurance)
## tibble [1,268 × 7] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:1268] 19 18 33 32 31 46 37 60 25 62 ...
## $ sex : chr [1:1268] "female" "male" "male" "male" ...
## $ bmi : num [1:1268] 27.9 33.8 22.7 28.9 25.7 ...
## $ children: num [1:1268] 0 1 0 0 0 1 3 0 0 0 ...
## $ smoker : Factor w/ 2 levels "yes","no": 1 2 2 2 2 2 2 2 2 1 ...
## $ region : chr [1:1268] "southwest" "southeast" "northwest" "northwest" ...
## $ charges : num [1:1268] 16885 1726 21984 3867 3757 ...
There’s no missing data in this data set and we also don’t have to worry about the categorical data because R will make them indicator variables as we run lm later.
To get a feel for the data set—this can also alert you to potential data entry mistakes and might suggest some variable transformation.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
insurance_num <- select_if(insurance, is.numeric)
pairs(insurance_num)
Above is the scatter plots for all numeric data.
Use each of the potential quantitative predictors and indicator variables for the qualitative predictors. Use untransformed Y as the response variable unless a transformed Y is suggested beforehand [e.g., highly skewed Y is often better analyzed as \(log_e(y)\)]. If there is background knowledge which suggests that particular transformations or interactions might be important, then include them in this initial model.
summary(full_model)
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11527.3 -2887.5 -995.5 1422.3 29711.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11909.72 1072.11 11.109 < 2e-16 ***
## age 259.34 12.35 20.999 < 2e-16 ***
## sexmale -30.15 345.34 -0.087 0.93045
## bmi 337.45 29.62 11.394 < 2e-16 ***
## children 466.98 142.91 3.268 0.00111 **
## smokerno -23954.39 426.63 -56.148 < 2e-16 ***
## regionnorthwest -318.01 496.36 -0.641 0.52185
## regionsoutheast -913.78 495.21 -1.845 0.06524 .
## regionsouthwest -899.42 492.25 -1.827 0.06791 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6121 on 1259 degrees of freedom
## Multiple R-squared: 0.75, Adjusted R-squared: 0.7484
## F-statistic: 472.1 on 8 and 1259 DF, p-value: < 2.2e-16
If there is a strong suggestion that one or more of the assumptions is violated, then proceed to step 7; otherwise, if everything checks out, proceed to step 8.
par(mfrow=c(2,3))
plot(full_model)
plot(full_model$residuals, main = "Residuals vs Order Plot")
Linearity: The relationship between the independent variables and the dependent variable should be linear. However, the residuals vs Fitted model seems to have a coned shape pattern, then linearity may be violated.
Normality: The residuals should be normally distributed. However, the QQ plot seems to be curved up. Since it shows a non-normal distribution, then normality may be violated.
Constance Variance: The variance of the residuals should be constant across all levels of the independent variables. However, the residuals vs Fitted model. The spread of the residuals increases as the predicted values increase.
Independence: The residuals should be independent of each other. There does not seem to be a pattern in the residuals vs. ordered plot.
According to the p-values found in the full model, we may have suspicions regarding whether or not \(sex\) and \(region\) have a strong influence to the \(charges\). We can run a partial F-test for this.
reduced <- lm(charges ~ age + bmi + children + factor(smoker), data = insurance)
summary(reduced)
##
## Call:
## lm(formula = charges ~ age + bmi + children + factor(smoker),
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12036.1 -2936.3 -972.8 1381.6 29342.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11739.77 1028.49 11.41 < 2e-16 ***
## age 260.05 12.34 21.07 < 2e-16 ***
## bmi 323.06 28.37 11.39 < 2e-16 ***
## children 462.83 142.83 3.24 0.00122 **
## factor(smoker)no -23921.33 424.34 -56.37 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6123 on 1263 degrees of freedom
## Multiple R-squared: 0.749, Adjusted R-squared: 0.7482
## F-statistic: 942.3 on 4 and 1263 DF, p-value: < 2.2e-16
anova(reduced, full_model)
Hypothesis:
\(H_0:\) \(\beta_2\) = \(\beta_6\) = 0
\(H_a:\) At least one of the \(\beta's\) is not 0
Decision Rule:
Reject \(H_0\) if p-value < 0.05
Fail to reject \(H_0\) if p-value is \(\ge\) 0.05
Test Statistic:
p-value = 0.1654
Decision/Conclusion:
Since the p-value is 0.1654, it is greater than 0.05, which means we failed to reject the null hypothesis. In other words, we are able to used the reduced model without \(sex\) and \(region\).
hist(insurance$charges)
Since \(charges\) is highly skewed, we certainly have questions on whether or not we should use log transformation. We will conduct a Box-Cox to determine whether a log transformation on y is truly necessary.
Since normality is very skewed, we can use Box Cox in order to identify what transformation needs to be done.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:olsrr':
##
## cement
bc <-boxcox(reduced)
lambda <- bc$x[bc$y==max(bc$y)]
lambda
## [1] 0.1414141
With a lambda of 0.1414 repeating, it is closer to 0, which means that we will have to use the log transformation on the response variable. We will use the logged response from here on out.
log_reponse <- lm(I(log(charges)) ~ age + bmi + factor(smoker) + children, data = insurance)
summary(log_reponse)
##
## Call:
## lm(formula = I(log(charges)) ~ age + bmi + factor(smoker) + children,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11185 -0.19741 -0.04754 0.06978 2.07769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5311400 0.0755440 112.929 < 2e-16 ***
## age 0.0349574 0.0009067 38.556 < 2e-16 ***
## bmi 0.0105818 0.0020839 5.078 4.39e-07 ***
## factor(smoker)no -1.5515840 0.0311681 -49.781 < 2e-16 ***
## children 0.1000297 0.0104910 9.535 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4497 on 1263 degrees of freedom
## Multiple R-squared: 0.7628, Adjusted R-squared: 0.7621
## F-statistic: 1016 on 4 and 1263 DF, p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(log_reponse)
hist(log_reponse$residuals)
plot(log_reponse$residuals)
By logging just the response, the residuals vs fitted chart became a cone shape, which means that linearity is still a problem. There is an obvious curvature that is shown with this model from the residuals vs fitted chart. However, equal spread has improved. Normality is still a problem, and independence is still not violated.
pairs(insurance_num)
Since we know that we must log the response, we can test to see which response variable should be logged. According to the scatter plot matrix, we may need to log \(children\), and maybe \(bmi\) as well since those are ones where the graphs are all cluttered together in an uniform area.
vif(reduced)
## age bmi children factor(smoker)
## 1.014327 1.011952 1.001579 1.001196
We can also use the matrix to look for VIF, as shown, multicollinearity will not be a problem in this case.
Note: we have to log \(children + 1\) due to the fact that we cannot log 0 values, and some of the values for children contains 0.
logged_children <- lm(I(log(charges)) ~ age + bmi + factor(smoker) +
I(log(children + 1)), data = insurance)
summary(logged_children)
##
## Call:
## lm(formula = I(log(charges)) ~ age + bmi + factor(smoker) + I(log(children +
## 1)), data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09145 -0.20116 -0.05053 0.07288 2.08333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5161293 0.0755772 112.681 < 2e-16 ***
## age 0.0348967 0.0009052 38.553 < 2e-16 ***
## bmi 0.0104753 0.0020801 5.036 5.44e-07 ***
## factor(smoker)no -1.5502161 0.0311097 -49.831 < 2e-16 ***
## I(log(children + 1)) 0.2219038 0.0226404 9.801 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4489 on 1263 degrees of freedom
## Multiple R-squared: 0.7637, Adjusted R-squared: 0.763
## F-statistic: 1021 on 4 and 1263 DF, p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(logged_children)
plot(logged_children$residuals)
Logging children did improve the adjusted R-sq by a tiny bit; linearity and equal spread improved as well. They are still not the best however. We can try to logging \(bmi\) to see if will better our model.
logged_bmi <- lm(I(log(charges)) ~ age + I(log(bmi)) + factor(smoker) +
children, data = insurance)
summary(logged_bmi)
##
## Call:
## lm(formula = I(log(charges)) ~ age + I(log(bmi)) + factor(smoker) +
## children, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06754 -0.19480 -0.05033 0.07082 2.07333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7097262 0.2148210 35.889 < 2e-16 ***
## age 0.0348984 0.0009062 38.509 < 2e-16 ***
## I(log(bmi)) 0.3375252 0.0628408 5.371 9.31e-08 ***
## factor(smoker)no -1.5520986 0.0311298 -49.859 < 2e-16 ***
## children 0.1000456 0.0104786 9.548 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4492 on 1263 degrees of freedom
## Multiple R-squared: 0.7634, Adjusted R-squared: 0.7626
## F-statistic: 1019 on 4 and 1263 DF, p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(logged_bmi)
plot(logged_bmi$residuals)
After logging only BMI, the assumption plots are still very similar, and the adjusted R-sq is very close to what we had before.
logged_children_bmi <- lm(I(log(charges)) ~ age + I(log(bmi)) + factor(smoker) + I(log(children + 1)), data = insurance)
summary(logged_children_bmi)
##
## Call:
## lm(formula = I(log(charges)) ~ age + I(log(bmi)) + factor(smoker) +
## I(log(children + 1)), data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.04787 -0.20324 -0.05391 0.07524 2.09095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7013872 0.2144374 35.914 < 2e-16 ***
## age 0.0348375 0.0009047 38.506 < 2e-16 ***
## I(log(bmi)) 0.3346014 0.0627231 5.335 1.13e-07 ***
## factor(smoker)no -1.5507229 0.0310711 -49.909 < 2e-16 ***
## I(log(children + 1)) 0.2219591 0.0226129 9.816 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4483 on 1263 degrees of freedom
## Multiple R-squared: 0.7643, Adjusted R-squared: 0.7636
## F-statistic: 1024 on 4 and 1263 DF, p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(logged_children_bmi)
plot(logged_children_bmi$residuals)
Logging \(children\) and \(bmi\) did improve the adjusted R-sq by a tiny bit; the graph looks very similar to what we had before. We will continue with this model as it is the best one we have so far with the highest adjusted R-sq value out of the three log transformations.
Now that we have the logged transformation, we will like to test out certain interaction variables to see if they are significant to predict charges.
interaction_model <- lm(charges ~ age + bmi + children + factor(smoker) + bmi:smoker, data = insurance)
summary(interaction_model)
##
## Call:
## lm(formula = charges ~ age + bmi + children + factor(smoker) +
## bmi:smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14317 -1957 -1338 -469 29796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22573.112 1553.579 -14.530 < 2e-16 ***
## age 268.199 9.955 26.942 < 2e-16 ***
## bmi 1426.240 48.036 29.691 < 2e-16 ***
## children 499.413 115.135 4.338 1.55e-05 ***
## factor(smoker)no 19876.453 1711.631 11.613 < 2e-16 ***
## bmi:smokerno -1424.556 54.549 -26.115 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4935 on 1262 degrees of freedom
## Multiple R-squared: 0.8371, Adjusted R-squared: 0.8364
## F-statistic: 1297 on 5 and 1262 DF, p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(interaction_model)
plot(interaction_model$residuals)
With the addition of interaction variables, we get a R-sq value of 0.8388, which is the highest R-sq value we have obtained. We also have a fairly high adj R-sq value. Although we are unable to compare the two models directly, we can see that the diagnostic plots for the interaction variable is much better than the logged one. We will be using that as our final model.
cd <- cooks.distance(interaction_model)
plot(cd)
abline(a=0.5, b=0, col="red")
abline(a=-0.5, b=0, col="red")
According to the Cooke’s distance, none of the observations are greater than 0.5, which means that we do not need to remove any observations from the dataset because they are rarely so influencial.
st_red <- rstandard(interaction_model)
plot(st_red)
st_red[st_red > 2]
## 3 8 32 59 98 110 123 132
## 3.205346 3.144250 2.815317 3.078631 3.889590 3.413665 4.174042 2.838946
## 133 135 210 218 232 234 277 278
## 4.042542 2.439497 4.324639 2.296324 4.577266 2.564456 2.680335 2.970067
## 291 305 323 337 360 367 376 407
## 2.904730 3.729795 3.077686 3.025589 2.540102 3.759814 2.759557 4.187226
## 419 443 448 451 464 488 491 495
## 2.963111 3.638697 2.213716 2.161305 2.195029 4.972269 3.022320 2.412964
## 496 503 508 522 540 543 548 551
## 4.182876 2.419813 3.200366 2.804840 3.476956 3.663617 2.272447 2.804164
## 562 599 619 648 655 712 728 763
## 4.294331 3.482870 3.175303 3.203065 3.370333 2.304717 2.583914 4.035924
## 775 813 830 871 878 888 910 914
## 3.696845 2.769808 3.072564 2.631893 2.751626 4.452207 3.596641 2.872639
## 929 932 935 951 955 958 964 971
## 2.676080 2.372797 3.735370 2.232223 3.691846 4.242893 4.470659 3.677156
## 982 1041 1046 1064 1074 1081 1084 1095
## 3.866506 2.382366 2.665465 2.799669 3.003317 3.212294 2.499007 2.013193
## 1100 1132 1142 1146 1150 1164 1191 1232
## 2.648571 3.025402 4.612277 2.990084 2.173118 3.654485 3.340951 6.051660
## 1250 1259
## 2.185366 3.630167
abline(a=2, b=0, col="red")
abline(a=-2, b=0, col="red")
From the standard residuals, we can see that there are several points that have a standard residuals greater than \(abs(2)\). This could be a problem as there are points that are unusal.
lev <- hatvalues(interaction_model)
plot(lev)
dim(insurance)
## [1] 1268 7
abline(a=18/1338, b=0, col="red")
There are also several points that have high leverages.
Linearity and equal spread have both been improved, but normality is still a problem. Every predictor is significant in the summary.
summary(interaction_model)
##
## Call:
## lm(formula = charges ~ age + bmi + children + factor(smoker) +
## bmi:smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14317 -1957 -1338 -469 29796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22573.112 1553.579 -14.530 < 2e-16 ***
## age 268.199 9.955 26.942 < 2e-16 ***
## bmi 1426.240 48.036 29.691 < 2e-16 ***
## children 499.413 115.135 4.338 1.55e-05 ***
## factor(smoker)no 19876.453 1711.631 11.613 < 2e-16 ***
## bmi:smokerno -1424.556 54.549 -26.115 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4935 on 1262 degrees of freedom
## Multiple R-squared: 0.8371, Adjusted R-squared: 0.8364
## F-statistic: 1297 on 5 and 1262 DF, p-value: < 2.2e-16
According to this model, we found out that linearity and equal spread have improved from the original model. Independence is also not a problem as there is not visible pattern from the residuals vs index plot. However, as from before, normality is still a problem.
Normality may not be completely fixable due to the nature of the sample data. A disproportionate number of observations had charges under 20,000 dollars. This causes data points in the histogram to be very right side skewed. This disparity makes sense due to the nature of the healthcare industry-very few individuals incur a premium greater than 20,000, even 10,000.
However, since we are using the model to predict, the assumptions do not play a huge role in that aspect. With our models, our predictions are fairly close to what they intended to be with some certain points being worse/better than others. Thus, in the end, our model was able to predict the charges to a fair degree.
First, we will insert all of the possible interaction variables, and we can use backwards elimination in order to query the best model.
interaction_model <- lm(charges ~ age + bmi + factor(smoker) + children +
bmi:smoker + age:bmi + age:smoker + age:children +
bmi:children + children:smoker, data = insurance)
ols_step_backward_p(interaction_model)
##
##
## Elimination Summary
## ---------------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## ---------------------------------------------------------------------------------
## 1 age:smoker 0.8376 0.8364 9.0086 25176.9612 4935.3016
## 2 bmi:children 0.8376 0.8365 7.0447 25174.9977 4933.4122
## 3 age:children 0.8376 0.8367 5.1009 25173.0543 4931.5642
## ---------------------------------------------------------------------------------
From backwards elimination, we can remove \(bmi:children\), \(age:children\), \(age:smoker\), \(age:bmi\), and \(factor(region)\). We can then conduct a partial F-test to see if we are able to use the elimination model.
best_model <- lm(charges ~ age + bmi + factor(smoker) + children +
bmi:smoker, data = insurance)
summary(best_model)
##
## Call:
## lm(formula = charges ~ age + bmi + factor(smoker) + children +
## bmi:smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14317 -1957 -1338 -469 29796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22573.112 1553.579 -14.530 < 2e-16 ***
## age 268.199 9.955 26.942 < 2e-16 ***
## bmi 1426.240 48.036 29.691 < 2e-16 ***
## factor(smoker)no 19876.453 1711.631 11.613 < 2e-16 ***
## children 499.413 115.135 4.338 1.55e-05 ***
## bmi:smokerno -1424.556 54.549 -26.115 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4935 on 1262 degrees of freedom
## Multiple R-squared: 0.8371, Adjusted R-squared: 0.8364
## F-statistic: 1297 on 5 and 1262 DF, p-value: < 2.2e-16
anova(best_model, interaction_model)
Hypothesis:
\(H_0:\) \(\beta_5\) = \(\beta_7\) = \(\beta_8\) = \(\beta_9\) = \(\beta_{12}\) = 0
\(H_a:\) At least one of the \(\beta's\) is not 0
Decision Rule:
Reject \(H_0\) if p-value < 0.05
Fail to reject \(H_0\) if p-value is \(\ge\) 0.05
Test Statistic:
p-value = 0.9926
Decision/Conclusion:
Turns out we are able to use the model as the partial F-test returned 0.9926, which means that we fail to reject the null hypothesis.